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Splines  in  Statistics 


ABSTRACT.  Spline  functions  are  particularly  appropriate  in  fitting  a 
smooth  non-parametric  model  to  noisy  data.  The  use  of  spline  functions  in 
non-parametric  density  estimation  and  spectral  estimation  is  surveyed.  The 
requisite  spline  theory  background  is  also  developed. 


Key  Words  and  Phrases.  Spline,  Cubic  Spline,  Interpolating  Spline, 
Smoothing  Spline,  Density  Estimation,  Spectral  Density  Estimation 
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Splines  in  Statistics 


§1  General  Introduction 

Statistics  (as  we  know  it)  began  with  fitting  parametric  models  to  data. 
Various  principles  for  making  estimates  and  inferences  were  developed  and 
refined  until  their  efficiencies  reached  their  (asymptotic)  limits  with 
methods  such  as  maximum  likelihood,  minimum  variance,  and  likelihood  ratio. 
Attention  returned  to  the  basic  models  and  there  appeared  a growing  realization 
that  not  all  of  the  parametric  structure  was  needed  to  make  inferences.  From 
this  idea  arose  the  techniques  variously  known  as  distribution  free  or  non- 
parametric.  At  the  cost  of  some  loss  of  efficiency  in  certain  instances, 
these  methods  prevented  model  violations  from  being  reflected  in  false  infer- 
ences. Although  non-parametric  methods  have  had  some  remarkable  success  (for 
example,  the  theory  of  rank  tests)  they  all  too  often  ignore  useful  non-statis- 
tical  information  that  may  be  present,  and  as  a result  lose  efficiency. 

The  classical  method  of  incorporating  non-statistical  information  is  by 
means  of  the  Bayesian  framework.  In  the  absence  of  any  canonical  methods  of 
determining  and  assessing  priors,  this  has  to  be  regarded  with  suspicion. 
Happily,  there  are  at  least  three  ways  to  incorporate  validly  non-statistical 
knowledge  in  inference  procedures.  We  discuss  each  in  turn. 

The  first  occurs  when  it  is  realized  that  the  predominantly  normal  data 
contains  a certain  amount  of  contamination,  i.e.  the  normal  model  is  roughly 
correct.  This  knowledge  may  come  from  central  limit  type  considerations.  The 
robustness  methods  of  Huber  (1964)  and  Hampel  (1974)  exploit  this  knowledge. 

If  a system  is  known  to  have  an  order  structure,  this  knowledge  may  be 

% 

exploited  by  methods  known  as  isotonic  inference.  The  book  by  Barlow,  et  al 
(1972)  reviews  most  of  these  techniques.  Knowledge  of  order  often  follows  from 
elementary  consideration  of  the  structure  of  the  system  being  modelled. 
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The  third  sort  of  non-statistical  knowledge  we  can  use  is  knowledge  of 
smoothness.  The  least  action  principle  of  dynamics  suggests  that  nature  makes 
things  change  as  smoothly  as  possible.  Our  main  concern  in  this  article  will 
be  with  a certain  measure  of  smoothness  and  the  consequences  of  making  the 
underlying  function  as  smooth  as  possible.  A function  which  optimizes  a 
smoothness  criterion  is  a spline.  Perhaps  a little  ironically  spline  methods 
also  provide  a sound  route  for  Bayesians  to  re-enter  the  scene,  as  Kimeldorf 
and  Wahba  (1970  and  1971)  show  there  of’o.)  exist  a Bayesian  model  wh^ch  gives 
a smoothing  spline  as  the  posterior  mean  in  that  model,  given  the  data.  Models 
which  require  an  order  structure  as  well  as  smoothness  lead  us  to  consider 
isotonic  splines.  Some  original  results  in  this  area  are  presented  in  Wright 
(1977). 

The  present  account  is  organized  as  follows:  We  begin  from  first  prin- 
ciples and  develop  those  parts  of  spline  theory  which  have  proved  most  relevant 
to  the  recent  applications  in  statistics.  We  then  review  the  literature 
associated  with  the  main  applications  of  splines  to  statistical  problems,  ending 
with  some  general  remarks  on  isotonic  splines. 

We  shall  reserve  the  symbol  D for  the  differentiation  operator  and  the 
symbol  I2  for  the  set  of  measurable  square  integrable  functions  on  the  interval 
[0,1].  The  symbol  W will  denote  the  set  of  functions  f on  [0,1]  for  which  D^f 
is  absolutely  continuous  for  j » 0,1,..., m-1  and  D^'f  is  in  L2  When  we 
occasionally  consider  functions  with  domain  other  than  [0,1]  the  relevant  domain 
will  be  shown  after  the  function  space  symbol  above,  e.g.  W^(-<»,»). 

§2  Classical  Spline  Theory 

The  spline  is  the  engineer's  solution  to  a problem  -frequently  concerning 
engineers.  The  problem  is  to  fit  a curve  through  points  i = l,2,...,n 


in  the  plane.  However,  the  engineer  often  needs  to  obtain  values  of  the  first 
and  second  derive v.ives  of  the  underlying  function  from  this  fitted  curve.  The 
spline,  an  optimal  solution  of  this  problem,  uses  the  analogs  with  weightless 
beans,  port  of  the  engineer's  stock  in  trade.  A precise  account  of  the 
engineer's  spline  follows. 

Let  {(tj,yj^);  i = l,...,n}  be  the  (error-free)  points  in  the  plane  for 
which  we  seek  an  inr.erpol ant . Calx  A * ^ ^>2  ^ ^’3  ^ 

a mesh-  For  computational  considerations  the  mesh  will  normally  just  be  the 
numbers  (t.:  i « l,...,n}. 

A (cubic)  spline  with  mesh  A,  written  S^(f)>  is  a function  with  contin- 
uous deriv/atives  up  -o  (and  including)  order  2 which  coir: Ides  exactly  with  a 
(possibly  different)  cubic  function  on  each  interval  i * 1,...,N-1. 

The  points  i --  2,...,N-1}  are  called  the  knots  of  cvie  spline.  A spline 
S^(t)  which  in  addition  satisfies  = Xj.  i = 1.2,..  n is  an  interpolant 

for  the  data,  f o.tie  f>irther  restriction  is  needed  in  order  to  i;  Ae  this  inter- 
polant unique  on  Although  for  certain  applications,  other  end  condi- 

tions are  more  coiivenient,  the  most  natural  condition  i» 

This  corresponds  to  giving  the  analogous  beam  cantilevered  ends  (protruding 
beyond  the  end  poivts)  and  also  minimizes  the  "energy  of  flexion"  of  the  beam 
(i.e.  the  mean  square  curvature).  The  proofs  of  these  various  properties  are 
established  in  a more  general  context  by  various  contributors  to  the  theory  of 

L-splines.  See  Ahlberg,  Nilson  and  Hfalsh  (1967).  Notice  that  if  the  spline 

3 2 

between  (tj.y^)  and  equation  y = a^t  + b^t  + cA  + d^,  the 

continuity  of  the  lower  derivatives  ensures  that  b^,  c^^,  d^  are  constants, 
independent  of  i. 
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Since  curve  fitting  is  a very  practical  task,  it  is  against  nuaerical 
considerations  that  a curve  fitting  method  must  be  judged.  The  ease  with  which 
a cubic  spline  is  fitted  must  have  contributed  to  its  popularity. 

We  now  sketch  the  main  steps  in  fitting  a cubic  spline  to  data.  Suppose 
data  is  {(t^.y^^),  i » l,2,...,n}.  Let  » t^^j  - t^  and  take  ■ value  of 

second  derivacive  of  interpolating  spline  S^  at  t^,  i » l,2,...,n. 

Siy>pose  the  polynomial  interpolating  (t^.y^)  and 

2.1  y * a^(t-tj)^  + b^(t-t.)^  + c^Ct-t^)  + d. 

then 


2.2 


' b.  - kl./2 

a.  - (M.^j  - Mp/6hi 

y.  , - y,  2(h.M.  + b.M.^,) 
'i+l  'i  11  1 i+1 

‘^i IT 6 

di  - yi 


Thus  our  curve  fitting  problem  reduces  to  that  of  finding  the  values  of  M^. 

The  equations  relating  the  are  obtained  by  using  the  continuity  of  the  first 
derivative  of  the  spline,  Along  with  the  relations  2.2  to  give 


“i-l  Mi»l  * * 2h,)M^  . = 6(-^  - ^ 


i 'i-i. 


i-l 


for  i * 2,3, . . . ,n-l . 


VWf.  wi'  *lK'»1ffWnBiFrr»  Tni»<;  ytTi/ii 


I Our  denand  that  > 0 leads  iflwedlately  to  a tridiagonal  system  of  linear 

equations  for  ..  This  system  can  be  easily  solved  by  Gaussian  elimi- 

it  11“  1 

nation.  So  eaily  in  fact,  that  fitting  an  interpolating  spline  to  40  points 
is  feasible  with  only  a simple  calculator. 

It  was  soon  realized  that  the  cubic  spline  was  the  solution  of  an  optlmi- 

I 

i zing  problem  which  could  be  easily  generalized.  Let  L be  a differential 

' operator  with  constant  coefficients  of  order  m (usually  o'”}  and  let 

i 

I {(tjjXi):  i “ l,...,n}  be  data  points.  The  problem 

I ^ i 

\ 

■ \ 

I minimize 

I I 

subject  to 

■ ' 

f and 

I 
I 

I 

has  a solution  f(t)  which  satisfies  L*Lf(t)  » 0 in  the  intervals  between  knot 
points,  where  L*  is  the  adjoint  operator  to  L.  We  call  such  a solution  an 
"Interpolating  L-Spline".  There  are  accounts  of  such  splines  in  the  books  by 
I Ahlberg,  Nilson  and  Walsh  (1967)  and  T.  N.  E.  Greville  (1969).  However,  for 

I 

I statistical  purposes  another  type  of  spline  turns  out  to  be  more  useful. 


e j " 0,1, ...,m 

f(tj)  ® ^ “ l,...,n 


§3  Smoothing  Spline  Theory 

For  most  applications  in  statistics  the  smoothing  spline  is  much  more 
useful  than  the  interpolating  spline.  This  is  because  most  real-life  data  is 
subject  to  error  be  it  from  sampling,  measurement  or  other  sources.  There  are 
two  main  spline  fitting  methods  in  common  use  corresponding  to  different  ways 
of  dealing  with  the  "noise"  in  the  data.  Because  this  data  does  not  constrain 
the  fitted  function  nearly  as  firmly  as  in  the  interpolating  spline  case,  the 
fitting  requires  a genuine  optimization  routine,  not  just  the  simple  solution 
of  a linear  system  of  equations  as  with  the  cubic  interpolating  spline. 


The  first,  more  frequently  used  method,  parallels  the  least  squares  curve 
fitting  procedure  by  minimizing  a criterion  depending  on  squares  of  deviations 
from  data  points  and  on  the  "roughness"  of  the  fitted  curve.  When  we  have 
little  or  no  knowledge  of  the  magnitude  of  possible  errors  in  our  data  this 
method  is  the  appropriate  one  to  use. 

On  the  other  hand  when  the  data  points  are,  for  example,  direct  readings 
frcHU  a calibrated  instrument,  we  may  be  able  to  set  fairly  narrow  100%  confi- 
dence limits  for  each  data  point.  The  second  method  is  used  in  these  circum- 
stances. For  this  we  need  to  replace  the  ordinate  y in  the  two  dimensional 
data  with  a 100%  confidence  interval,  and  constrain  the  fitted  spline  function 
to  pass  through  all  of  these  intervals.  This  is  accomplished  in  practice  by 
using  Ml  optimization  routine  to  minimize  the  (convex)  roughness  criterion, 
subject  to  the  linear  constraints.  It  will  be  noticed  that  this  method  attaches 
considerable  importance  to  outliers,  rather  than  largely  ignoring  them. 


Fipet  Method  of  Fitting  Smoothing  Splinee 
Suppose  the  t values  of  the  data  lie  in  a finite  interval  say  [0,1]  and 
we  have  0 < tj  < t2  < • . • < t^  < 1 . Fitting  the  spline  leads  us  to  solving 
the  following  problem. 


3.1 


Minimize 


I U(t.)-y.)^  + X /J  dt 


Subject  to  f e W^,  \ fixed  > 0. 


The  solution  is  given  explicitly  in  the  paper  of  Kimeldorf  and  Wahba  (1970) 
and  as  expected  turns  out  to  be  a polynomial  spline  of  degree  2m-l  with  pos- 
sible knots  at  the  data  points.  As  so  often  happens,  this  theoretical  solution 
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cannot  be  used  as  an  algorithm  in  any  realistic  practical  case.  When  the  t 
values  are  evenly  spaced  throughout  [0,1],  and  f is  periodic,  Cogburn  and 
Davis  (1974)  show  how  to  do  the  fitting  more  easily.  Apart  from  this  one  happy 
instance,  a heavy  ortimization  is  invariably  required. 

Notice  that  the  number  X > 0 in  3.1  controls  the  amount  of  smoothing;  X 
too  small  results  in  overfitting  and  insufficient  removal  of  noise,  whereas  X 
too  large  results  in  underfitting  and  removal  of  much  of  the  wanted  signal  with 
the  noise.  Clearly  the  correct  choice  of  X is  of  the  greatest  importance.  A 
satisfactory  solution  to  this  problem  is  given  by  Wahba  and  Wold  (1975), 
although  not  all  theoretical  consequences  are  yet  developed. 


Second  Method  of  Fitting  Smoothinq  Splines 


From  what  was  written  in  the  preamble  of  this  section,  the  reader  will 
have  seen  that  this  spline  is  a cross  between  the  interpolating  spline  and  our 
first  smoothing  spline.  For  this  reason  the  fitting  technique  is  also  known 
as  (the  solution  procedure  for)  the  Generalised  Hermite-Birkhoff  Interpolation 
Problem-  (GHB  problem) . 

Let  be  the  100%  confidence  interval  for  the  ordinate  at  t^  (with 

< 6j) . The  GHB  problem  is 


Minimize 


/J  (f^^b^  dt 


subject  to  the  constraints  f e W , a.  < f(t.)  < 3. 

m'  1 - 1 - 1 

for  i = 1,2, ... ,n. 
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Various  recent  contributions  to  the  theory  of  such  splines  ha«i’e  been 
made  by  M.  Attala  (1968),  P.  J.  Laurent  (1969),  and  K.  Ritter  (1969).  Because 
Hilbert  space  methods  are  directly  applicable,  this  spline  has  been  more 
thoroughly  theoretically  analyzed  than  the  first  smoothing  spline. 

For  the  record,  the  solution  of  3.2  is  a spline  of  degree  2m-l  with  knots 
at  those  data  points  where  the  constraints  are  active.. 


S4  Some  Simplifications 

For  most  practical  applications  a slightly  sub-optimum  solution  may  be 
preferable  to  the  optimum  if  it  involves  a great  deal  less  computation  effort. 
We  have  already  seen  that  not  all  of  the  data  points  are  knot  points  for  the 
smoothing  spline.  By  using  the  following  guidelines  the  old  statistical  virtue 
of  eyeballing  the  data  can  be  converted  to  considerable  computational  advantage. 
Knot  Point  Selection  (Cubic  smoothing  spline). 

1)  Knot  points  should  be  at  data  points. 

2)  Try  to  have  at  least  4 or  5 data  points  between  knots. 

3)  Have  not  more  than  one  extremum  and  one  inflexion  point  between  knots. 

4)  Have  extrema  centered  in  intervals  and  inflexion  point  near  knots. 

For  more  details  see  Wold  (1974) . 


The  form  of  t'xe  optimal  spline  function  f. 

Suppose  data  points  (t.  , t.  , ...,  t.  } have  been  chosen  as  knots. 

^1  ^2  2 
the  optimal  spline  function  will  be  f such  that  f(t)  = + a^t  + a2t'^  + 


(t 


t. 

1. 


) 


J 


3 

+ 


where 


(t  - c) 


3 

+ 


= 0 

= (t  - c) 


3 


Then 

J 


t < c 
t > c . 


-’Vimrm 
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§5  Bayesian  Estiattion  Again 

We  are  now  in  a position  to  make  our  remarks  about  the  equivalence  of 

smoothness  and  Bayesian  posterior  means  precise. 

Let  L « be  a differential  operator  and  let  B * * 

-1  ik 

positive  definite  matrix.  Suppose  B ■■  [b*^  ] . 


Problem  I 


Find  f € W which  minimizes 

n 

I (fCtj)  - )'.)b'''Cf(t  ) - yj  . (Lf)^  dt 

).y  ‘ 


Problem  11 


Find  f(t)  with  f(t)  = E(x(t)ly(tj),  yUp.  ....  y(tp) 

where  y.  = x(t.)  ♦ e.  with  e.  N(0,B)  and  x(t)  is  a 
'j  J J J 

stationary  Gaussian  process  with  mean  zero  and  spectral 
density  f(A)  = ^ "[pHyp  ' ^j.O 


In  their  paper  of  1970,  Kimeldorf  and  Wahba  show  that  the  solution  f of  Problems 
I and  II  is  the  same  function. 

When  the  errors  in  our  observations  are  independent,  this  theorem  tells 
us  nothing  new  about  fitting  the  spline.  However,  if  we  are  forced  to  effect 
an  estimation  from  a non-independent  sample  where  the  errors  have  known  auto- 
correlation, this  result  provides  the  solution.  Since  this  question  is  periph- 
eral to  the  main  objective  of  our  account  we  now  let  the  matter  rest. 

§6  Some  General  Remarks 

(a)  A little  reflection  will  convince  the  reader  that  smoothing  spline  methods 
will  be  the  most  useful  when 
(i)  An  appropriate  parametric  model  is  not  known  and 
(ii)  High  accuracy  is  needed  »nd 
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(iii)  A considerable  amount  of  (noisy)  data  is  available. 

When  these  conditions  are  satisfied,  spline  methods  will  give  very  good 
value  for  the  computational  effort  invested. 

(b)  The  (smoothing)  spline  is  very  much  the  child  of  its  age  - the  1960 's  - 
depending  as  it  does  on  optimizatioh  theory  and  the  medium/large  computer 
for  its  implementation. 

§7  Introduction  to  the  Literature 

In  the  introduction  we  tried  to  place  spline  methods  in  the  statistical 
scheme  of  things.  Splines  will  be  seen  to  be  a departure  from  the  most  general 
non-parametric  model  back  a little  towards  the  (structure  rich)  parametric 
sltuattoh.  The  reward  for  the  changed  position  is  improved  efficiency  coupled 
Still  with  noh-parametric  integrity. 

Although  spline  functions  were  available  in  a highly  refined  form  by  the 
mid  1960's  they  were  for  some  years  largely  ignored  by  statisticians.  This 
situation  was  dramatically  changed  by  the  appearance  of  the  paper  by  Kimeldorf 
and  Wahba  (1970).  Although  the  authors'  intention  was  to  show  the  equivalence 
of  smoothing  by  splines  with  the  finding  of  a posterior  mean,  the  real  effect 
was  to  convince  statisticians  that  splines  were  effective  and  not  as  difficult 
as  they  had  thought. 

When  all  is  said  and  done,  spline  methods  are  just  a way  of  fitting  a 
smooth  curve  to  some  data.  The  curve  estimates  most  studied  in  the  statistical 
spline  literature  are  for  non-parametric  density  estimation  from  an  independent, 
identically  distributed  sample  and  for  the  estimation  of  the  spectral  density 
of  a stationary  time  series.  Splines  have  also  been  used  in  other  areas,  but 
are  at  present  more  a curiosity  than  a serious  practical  tool. 


§8  Non-Parametric  Density  Estimation 

Our  account  will  be  confined  exclusively  to  density  estimation  based  on 
an  independent  identically  distributed  sample.  There  are  two  main  routes  we 
may  follow;  we  may  use  the  empirical  distribution  function  in  some  way  or  we 
may  use  an  appropriate  analogue  of  maximum  likelihood  adapted  to  the  infinite 
dimensional  (non-parametric)  situation. 

The  empirical  density  is  very  easily  obtained  from  the  empirical  distri- 
bution when  we  use  the  Sobolev  spaces  because  of  the  following: 

Lemma:  The  solution  f of  the  problem 

...  Minimize  A dt  with  f c W 

lAj  u m 

and  f(tj)  = y.,  i = l,.,.,n 

and  the  solution  g of  the  problem 

Minimize  A dt  with  g e W , 

(B)  ^ 

and  (D  g)(t^)  = y^,  i * l,...,n 

are  related  by  Df  » g.  This  means  the  empirical  spline  fitted  density  is 
obtained  by  differentiating  the  spline  fitted  distribution  function. 

The  histosplines  described  by  Boneva,  Kendall,  and  Stefanov  (1971)  are 
empirical  densities,  in  the  nature  of  a smooth  analogue  of  a histogram,  with 
pleasant  mathematical  features.  To  make  their  analysis  feasible,  the  authors 
are  prepared  to  allow  densities  which  sometimes  take  small  negative  values  in 
a small  region. 
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Let  W^(-a>,n)  denote  the  set  of  functions  on  the  real  line  which  are,  along 
with  their  first  derivatives,  in  Ljt-”,®).  Let  S.^  denote  the  set  of  square 
summable  (double  ended)  real  sequences  with  inner  product  (h,k)  ■*  ^ *'i^i' 

Define  0:  Wj-+-  by  (Gu)^  = u(t)dt.  Now  define  an  inner  product  [u,v]  on 
Wj  by  [u,v]  » (9u,  0v)j^  ♦ u'v'  dt.  Write  Z = {u  c W^;  0u  « 0}  and 
S 3 {s  £ W^;  [s,u]  > 0 for  all  u € Z}.  Each  o c has  unique  decoiiq>osition 

0 ■ s ♦ z,  s € S,  z e Z.  Thus  we  obtain  the  projection  P:  S where  Po  = s. 

Then  the  authors  show 

1)  0 is  a 1-1  bicontinuous  map  S -►  i!,^ 

2)  S consists  of  all  s c W, : /”  s'z’dt  = 0 for  all  z e Z 

3)  For  given  h e is  the  unique  solution  of  6o  » h which 

winimizes  (a')^dt 

4)  S consists  of  those  functions  continuous  and  continuously  differenti- 
able such  that 

i)  s(t)  is  quadratic  in  each  cell 
ii)  / (s^  + s'^)dt  < « . 

The  delta-spline  is  that  function  e S which  has  (05^)^  = 1,  (05^)^^  = 0, 

1 0.  This  function  is  tabulated  explicitly  in  the  paper.  The  maneuvering 
with  Z and  the  unusual  choice  of  the  inner  product  [ ] is  rewarded  with  the 
following  result. 


Proposition;  Take  h e ^ * (hj)-  Po*"  integer  j,  let  s^  be  the  trans- 
lated delta-spline  with  (6Sj)j  ® = 0,  i / j.  Then  the  unique 

2 

histospline  o c Wj  which  has  9a  = h and  which  minimizes  / (o')  dt  is  given 
by  0 = ^ Sy  The  paper  of  Boneva,  Kendall  and  Stefanov  (1971)  also 

describes  another  histospline  and  includes  much  empirical  material  on  histo- 
spline behavior. 


Some  Femarke  on  Uietoaplinea 


1.  Once  the  tabulated  form  of  the  delta-spline  is  stored  in  the  computer 


(requiring  39  parameters)  there  is  no  explicit  optimization  required  - just  the 


grouping  of  the  data  into  classes.  Consequently  this  method  is  well  suited  to 


programmable  calculators  and  mini-computers  without  optimization  routines. 


2.  Boneva,  Kendall  and  Stefanov  (1971)  and  Schoenberg  (1972a  and  b)  also 


consider  the  variant  histospline  defined  as  the  derivative  of  that  function  G 


in  W2[0,l]  which  solves: 


Minimize  (G")^dt 


Subject  to  G(0)  = 0 and  G(ih)  = h^  i » 1,2,..., I 


where  (£+l)h  = 1 and  G*(0)  = G'(l)  = 0 . 


Yet  another  variant  of  this  problem  is  considered  by  Wahba  (1975b).  This 


involves  replacing  the  final  constraints  in  8-1  by 


G'(0)  = a^  and  G*(l)  * b^  where  a^,  bj 


are  calculated  from  the  empirical  distribution  function.  This  variant  gives 


better  accuracy  near  0 and  1 than  8.1.  When  the  criterion  is  minimum  mean 


square  error  at  a point,  Wahba  (1975b)  also  shows  how  to  choose  h optimally. 


Finally  we  note  that  if  the  problem  8.1  has  the  further  constraint 


G'(t)  > 0 for  all  t e [0,1]  the  solution  will  be  isotonic  with  respect  to  a 


natural  order  on  W2  and  will  give  a more  acceptable  density  function. 


3.  It  must  be  emphasized  that  histosplines  are  interpolating  splines  based  on 


^he  sample  histogram,  and  not  a smoothing  spline.  Consequently  in  the  presence 


of  noise  (sampling  error)  we  cannot  expect  this  method  to  be  much  better  at 


filtering  the  noise  than  the  histogram  it  is  derived  from. 


14 


This  assertion  is  supported  by  the  results  of  Wahba  (1975b)  who  shows 

for  her  variant  of  the  histospline  that  for  the  true  density  f e W and  f the 

m n 

histospline  corresponding  to  a sample  of  size  n 

E(f^(t)  - f(t))^  = 0 (n-(2m-l)/2mj 

In  a companion  paper  Wahba  (1975a)  shows  that  the  expected  mean  square  error  at 
a point  t has  that  same  order  of  magnitude  for  all  of  the  following  estimation 
methods:  the  polynomial  algorithm  (Wahba),  kernel  type  estimator  (Parzen) , 
certain  orthogonal  series  estimates  (Kronmal -Tatar) , and  the  ordinary  histogram. 
However,  the  constants  covered  by  the  0 may  be  larger  in  these  latter  cases. 

59  Densities  by  Maximum  Penalized  Likelihood 

This  area  is  realtively  unexplored  to  date.  The  analogy  with  parametric 
maximum  likelihood  estimation  gives  rise  to  the  hope  that  Maximum  Penalized 
Likelihood  Estimators  (MPLEs)  may  be  optimal  in  some  fundamental  sense.  We  now 
look  at  some  of  the  details. 

Let  n be  an  interval  (a,b)  and  let  H(J1)  be  a manifold  in  Lj(fi). 

(Manifold  = Set  of  "reasonably  similar"  functions).  Suppose  (tj,t2, . . . ,t^)  is 
a i.i.d.  sample  from  an  unknown  density  f e Lj(Jl).  Unfortunately  the  problem 

n 

Maximize  L(v)  = ] f v(t.)  subject  to  v £ H(fi) , 

9.1  i=l  ^ 

v(t)dt  = 1,  v(t)  >0  Vt  e 

will  not  have  a solution  for  most  manifolds  of  interest  (the  unimodal  or  monotone 
functions  are  an  exception) . Specifically,  any  manifold  which  contains  an 
approximating  sequence  to  any  linear  combination  of  6-functions,  admits  no 
maximum  likelihood  estimator  for  the  density  f. 
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From  heuristic  Bayesian  considerations,  Good  and  Gaskins  (1971)  suggested 
adding  a penalty  term  to  the  likelihood  which  would  penalize  unsmooth  estimates. 
They  chose  a manifold  and  penalty  function  that  lead  inevitably  to  polynomial 
splines.  Good's  and  Gaskin's  results  were  refined  and  made  rigorous  by 
de  Montricher,  Tapia  and  Thompson  (1975).  We  can  now  describe  the  current 
state  of  the  art. 

It  will  normally  be  the  case  that  the  manifold  H(n)  is  contained  in  W^(n) 
and  the  penalty  function  iKv)  = (D"'v)^dt.  Let 

n 

L(v)  = 7T  v(t.)  exp(-il>(v)) 
icl  ^ 

and  consider  the  optimization  problem 


Maximize  L(v)  subject  to 

(i)  V e H(fl) 

9.2 

(ii)  V dt  = 1 

(iii)  v(t)  ^0,  Vt  e n 

The  solution  v is  the  MPLE  of  the  underlying  density,  f. 

The  task  of  computing  the  MPL  Estimate  of  the  density  is  greatly  simpli- 
fied by  knowing  the  form  the  optimum  must  take.  The  following  existence 
theorem  is  proved  in  the  paper  by  de  Montricher,  Tapia  and  Thompson  (1975) . 

Theorem:  For  m ^ 1,  the  MPLE  corresponding  to  W^  exists,  is  unique,  and  is  a 
polynomial  spline  of  degree  2m-l.  Moreover,  if  the  estimate  is  positive  in  the 
interior  of  an  interval,  then  in  this  interval  it  is  of  degree  2m-l  and  of 
continuity  class  2m- 2 with  knots  at  the  sample  points. 


i-^n/TTrrr^-  lyw mi  ttt- 
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From  (Fisher)  information-theoretic  considerations,  as  well  as  a desire 
to  avoid  the  awkward  non-negativity  constraint  v(t)  > 0,  Good  and  Gaskins  (1971) 
also  considered  the  MPLE  problem  with  manifold 

H,(fl)  = (v:  v^  < W,  (-«>,®))  and 
9.3  ^ 2 

4>,(v)  = a /“  dt  = 4a  r (Dv*'^)^  dt  a > 0 

I •OD  V 

U 2 

tdiere  v * (v  ) is  to  be  the  (necessarily  positive)  density. 

After  noting  that  the  reformulation  trick  (9.3)  is  standard  in  the  liter- 
ature, deMontricher,  Tapia  and  Thompson  (197S)  record  conditions  for  its  valid 
use  with  the  following  lemma. 


leimi;  Let  H be  a subset  of  L2(G)  and  J a functional  on  H.  Consider 


Problem  1 


Maximize  J(v  ) subject  to 

v'*  € H,  / V dt  = 1,  v(t)  > 0 Vt 


and  Problem  II 


Maximize  J(u)  subject  to  u e H 
2 

and  / u dt  = 1 


2 

Let  u*  be  a solution  of  II.  Then  v*  = (u*)  solves  I if  and  only  if  |u*|  e H 
and  J(u*)  = J(|u*|) . 

The  authors  (deMontricher,  Tapia  and  Thompson)  go  on  to  establish  that 
the  price  of  using  the  non-negativity  trick  is  to  lose  the  polynomial  spline 
form  of  solution  - the  solution  is  an  exponential  spline  instead,  with  knots 
at  the  sample  points. 

The  paper  of  Good  and  Gaskins  (1971)  shows  how  one  might  prove  that  MPLEs 
are  weakly  consistent  and  also  gives  algorithms  and  some  empirical  material. 
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We  regretfully  record  that  on  the  two  most  important  aspects  > how  effec- 
tively noise  is  filtered  out,  and  the  asymptotic  (large  n)  performance  - the 
literature  on  MPLEs  is  quite  silent. 

§10  Noise  Filtering  by  Smoothing  Splines 

Statisticians  and  applied  mathematicians  arc  continually  faced  with  the 
problem  of  recovering  a smooth  function  when  only  noisy  measurements  of  it  are 
available.  In  fitting  a parametric  model  the  residuals  are  made  up  of  the 
noise  as  well  as  tae  deviations  of  the  model  from  the  true  function.  Smoothing 
splines  are  admirably  placed  to  estimate  this  true  function  (known  only  to  be 
smooth)  for  two  reasons.  First,  they  are  flexible  enough  to  respond  to  any 
real  local  variation,  without  allowing  pathological  behavior,  and  second,  the 
actual  degree  of  smoothing  (=  filtering  of  noise  together  with  rapid  variation) 
is  controllable.  Even  when  the  correct  degree  of  smoothing  is  unknown,  these 
features,  in  conjunction  with  a technique  called  cross-validation  (to  determine 
the  correct  degree  of  smoothing)  allow  us  to  remove  most  of  the  model  deviation 
component  from  the  residuals,  leaving  virtually  only  the  (real)  noise.  We 
presently  give  an  account  of  the  main  features  of  fitting  smoothing  spline 
functions  by  cross  validation  - full  details  are  given  by  Wahba  and  Wold 
(1975a  and  b) . 

The  model  we  are  fitting  is 

Y(t)  = f(t)  ♦ e(t)  t e [0,1]  where 

f e W and  Ee(t)  =0  all  t and 
m 

2 

Ee(s)e(t)  =0  s = t 

= 0 s / t 


10.1 


I 

1 

► 

[ 

I 
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2 

The  noise  variance  a is  generally  unknown  and  Y(t)  Is  observed  at  (an 

Increasing  set  of  points)  t^.tj, . . .,t^. 

Consider  the  problem:  Find  f e W to 
‘ in 


Minimize  (i  Z(Y(t.)  - f(t.))^  + X /J  (f^"‘^)^dt) 
where  X >0is  a fixed  real  number. 


The  first  term  is  a measure  of  the  fidelity  to  the  data,  the  second  is  X times 
the  "smoothness"  of  f.  The  optimum  solution  is  known  (Greville  (1969), 

Reinsch  (1967))  to  be  a cubic  spline  with  knots  at  the  t^,  i * l,2,...,n.  As 
X -»■  <»,  the  solution  f . approaches  its  smoothest  possible  fora  - the  least 

Tl  I A 

sqiuarti  Straight  line  through  the  data.  As  X -*■  0,  f ^ ^ approaches  the  inter- 
polating spline  through  all  of  the  data  points.  Thus  we  call  X tha  degree  of 

amoothing.  It  is  shown  in  Wahba  (1973  and  1974)  that  in  order  to  have 
W 

D 

f _ , -*■  f as  n -*■  «>  we  must  also  have  X -*■  0. 
n j A 

If  (from  previous  experience  of  our  particular  problem)  the  correct  value 
of  X is  known,  we  have  only  to  solve  10.2  using  that  X.  Unless  the  problem 
10.1  can  be  converted  to  the  periodic  smoothing  spline  form  of  Cogburn  and 
Davis  (1974)  there  is  no  simple  way  of  solving  10.2  other  than  the  usual 
optimization  routine. 

When  X is  not  known  we  can  (with  much  labor)  use  the  Cross  Validation 
Mean  Square  Error  (minimizing)  technique  to  estimate  the  appropriate  degree  of 
smoothing  from  the  data  alone.  The  method  has  been  used  successfully  in  various 
applications  by  Feinberg  and  Holland  (1972),  Hocking  (1972),  Mosteller  and 
Wallace  (1963)  and  others.  In  effect  the  CVMSE  method  gives  the  value  of 
parameter  which  maximizes  the  internal  consistency  of  the  data  set  with,  regard 
to  the  applied  model. 


Nahba  and  Wold  find  It  useful  to  recast  problem  10.2  into  a form  used  by 
Reinsch  (1967). 


1 2 

Find  f f W_  to  minimize  (f")  dt  subject  to 
10.3  ^ ” 

1 ^ 2 

— ^ (Y(t.)  - f(t.))  < S,  where  S is  given. 

" j.l  ^ J - 

It  is  well  known  that  if 

n S < inf  I (Y(t.)  - a + b t.)^ 

~ a.b  J J 

then  there  exists  a unique  A = X(S)  such  that  f . is  the  solution  to  10.2  and 

n I A 

l I - f„.X  = s. 

Armed  with  the  appropriate  tools  from  the  last  paragraph,  we  now  give  an 
account  of  Wahba  and  Wold's  Cross  Validation  procedure  (1975a). 

1)  Divide  the  data  set  into  p groups 

Croup  l:  tj,  ... 

Croup  2:  tj-  '2*p-  '2*2p’  ■" 

Croup  p;  tp,  tjp.  tjp,  ... 

2 

2)  Guess  a starting  value  for  S (Amost  invariably  S = ko  with 
0.7  < k < 1.  A reasonable  starting  point  might  be  k = 0.8). 

3)  Delete  the  first  group  of  data.  Fit  a smoothing  spline  to  the  rest 


of  the  data  using  the  method  of  Reinsch  with  the  S of  step  2.  Compute  the  sum 
of  squared  deviations  of  this  smoothing  spline  from  the  deleted  data  points. 
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4)  Delete  instead  the  second  group  of  data.  Fit  a smoothing  spline  with 
the  S of  step  2.  Compute  the  sum  of  squared  deviations  of  the  spline  from  the 
data  points. 

5)  Repeat  Step  4 for  the  3rd,  4th p^h  group  of  data. 

6)  Add  the  sums  of  squared  deviation  from  steps  3-5,  and  divide  by  n. 
This  is  the  CVMSE  for  S and  is  written  CV(S). 

7)  Determine  the  S = making  CV(S)  a minimum. 

The  smoothing  problem  10.3  can  now  be  solved  with  S = S^. 

2 

Empirical  studies  by  Wahba  and  Wold  (1975a)  indicate  that  when  a is 

2 

extremely  small,  the  CVMSE  estimate  for  k in  S = ko  has  positive  bias, 

resulting  in  very  slight  undersmoothing.  This  effect  is  negligible  for 
2 

realASlAC  sized  o , although  the  authors  do  not  present  a proof. 

§11  Periodic  Smoothing  Splines 

The  work  of  Cogbum  and  Davis  (1974)  has  been  referred  to  several  times 
already.  We  now  describe  their  results  in  detail. 

Let  G be  the  group  of  real  numbers  modulo  211  with  the  usual  topology  and 
measure.  The  model  to  be  fitted  is 

h(t)  = f(t)  + e(t)  Vt  e G 

11.1  with  f c W (G)  and  Ee(t)  =0,  Vt  c G 

m 

2 

and  Ee(s)  e(t)  = a s = t 

= 0 s t 

where  h is  observed  either  on  a lattice  of  points  or  continuously  and  the  noise 
2 

variance  a is  unknown.  The  asymptotic  solution  for  large  n devised  by  Cogburn 
and  Davis  is  very  convenient  to  handle,  and  easy  to  compute  since  it  avoids 
explicit  optimization. 
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§12  Estimatin£  Spectral  Densities 

Determining  spectral  densities  gives  spline  theory  not  only  a great 
opportunity  but  also  a severe  test.  When  the  spectrum  is  absolutely  continuous, 
spline  methods  are  extremely  effective.  However,  when  the  spectrum  has  a 
discrete  component,  i.e.  6-function  spikes,  spline  methods  based  on  have 
little  chance  of  sharply  resolving  the  spike  without  a great  deal  of  data  in 
the  vicinity  of  the  spike.  The  heart  of  the  difficulty  is  that  every  sequence 
of  functions  approximating  a 6-function  must  be  unbounded  in  norm  and  so 
also  in  norm,  and  thus  the  smoothing  spline  is  duty  bound  to  flatten  these 
real  spikes  out.  However,  before  abandoning  the  based  spaces,  we  need  to 
see  the  problem  in  perspective.  Because  of  the  superficial  similarity  between 
a 6-spike  and  a noisy  observation,  any  attempt  to  transfer  the  problem  to  a 
space  where  6-function  approximants  are  bounded  seems  doomed  to  failure  because 
we  would  be  unable  to  filter  out  the  noise  in  such  a space. 

Thus  reconciled  to  remaining  in  W^  with  its  pleasant  inner  product  and 
Fourier  Transform  structure  we  may  yet  be  able  to  find  a way  out.  One  strategy 
may  be  to  proceed  as  follows.  Since  further  data  points  are  easily  obtained 
from  the  periodogram,  it  might  be  feasible  to  use  repeated  applications  of  the 
CVMSE  method  coupled  with  a procedure  to  introduce  extra  data  points  in  regions 
where  the  rate  of  change  (of  fitted  function)  is  large. 

First  let  us  examine  the  current  state  of  the  art  for  estimating  spectral 
densities  with  spUne  methods. 

Estimating  the  Spectral  Density  of  a Statiomru  Stoohastio  Proaess 
Let  X ,X_,X_,...  be  a second  order  stationary  stochastic  process  with 

1^3 

EX^=0,  EX.X.^,=p^. 
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The  are  Fourier  coefficients  of  a symmetric  distribution  function  F on 
[-n,II],  Pjj  * /q  Cos  kw  d F(w).  When  F is  absolutely  continuous,  it  is 
completely  determined  by  the  spectral  density 


f(s)  = (DF)(w) 


f(w)  = P,,  e 


ikw 


The  statistical  problem  is  to  estimate  f(w)  on  the  basis  of  observations 

A 

Xj,X2, . . . ,X^.  Let  f(w)  denote  the  periodogram 


12.1 


n-1 

^(w)  = I Pic  ® 

-n+1 


n-k 


ikw 


P 

^0 


n-1 

^ Pj^  Cos  kw  whei'e 
k=l 


« rk 

. = - y X.X.  . , k = 0,...,n-l, 
k n j^i  j j+k* 


When  the  process  is  Gaussian  it  is  shown  in  Walker  (1965)  that 


12.2 


f(w)  = f(w)  U^(w)  + n„(w) 


where  ■>  0 in  probability  as  n -►  » and  U^(jll/n)  are  uncorrelated  exponential 
random  variables  with  a mean  and  variance  of  1 for  j = n-1. 

Since  the  periodogram  is  an  inconsistent  estimator  of  f,  some  modification  is 
required.  Smoothed  (consistent)  estimators  of  f(w)  are  obtained  either  by 
smoothing  the  periodogram 


f*(w)  = f(X)  K(w  - X)  dX 


or  by  weighting  the  covariances  by  a "lag  window"  kj^(r)  giving 


The  Pj^  are  Fourier  coefficients  of  a synmetric  distribution  function  F on 
[-n,n],  Pj^  “ ■/^Q  Cos  kw  d F{w).  When  F is  absolutely  continuous,  it  is 

completely  determined  by  the  spectral  density 


f(s)  = (DF)(w) 

QO 

\ V ikw 
f(w)  = 2.  Pj(  ® 

«00 


The  statistical  problem  is  to  estimate  f(w)  on  the  basis  of  observations 
Xj.Xj, . • . ,X^.  Let  f(w)  denote  the  periodogram 


12.1 


f(w) 


n-1 


I 

-n+1 


n-1 


p|^  Cos  kw  where 


, n-k 

p.  * i y X.X.  . , 
^ n j“l  J 2*^ 


0,...,n-l, 


When  the  process  is  Gaussian  it  is  shown  in  Walker  (1965)  that 


12.2  f(w)  = f(w)  U^(w)  + n^Cw) 

where  -►  0 in  probability  as  n ->■  » and  lJ^{jn/n)  are  uncorreiated  exponential 
random  variables  with  a mean  and  variance  of  1 for  j = 0,...,^  n-1. 

Since  the  periodogram  is  an  inconsistent  estimator  of  f,  some  modification  is 
required.  Smoothed  (consistent)  estimators  of  f(w)  are  obtained  either  by 
smoothing  the  periodogram 


f*(w)  - f(X)  K(w  - X)  dX 


or  by  weighting  the  covariances  by  a "lag  window"  kj^(r)  giving 
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f*(w) 


1 M 

— y 

2n  ^ 


r«-M 


k„(r).-™  a. 


vrfiere 


K(w)  = 


1 M 

ITT  ^ 

-M 


-irw  . , , 
e knj(r) 


Cogburn  and  Davis  (1974)  consider  estimating  f by  a CSS  or  LSS  to 

i.e.  f*"s  ,orf*t,.  They  take  L = D™  and  obtain  an  estimate  of  the 
n ^ A A 

n X a 2 

integrated  mean  square  error  for  n large.  It  is  / MSE(f(t))dt  = ^ 

f^dt  ♦ 4-  A (£(2in)j2  ^ minimizing  the  RHS  is 

-n  ,4m  -n 


f 

^ 2 
0 
m 


(^(2m))2dt/ 


1 

f^dt)^ 


and  the  resulting  MSE  is  0(i) . 

Wahba  and  Wold  (197Sb)  are  concerned  with  estimating  log  f(w)  for  spectral 
densities  f which  are  bounded  below.  Taking  the  logarithm  of  equation  12.2 
converts  the  curve  fitting  problem  to  that  of  §11. 

We  have 


Y(j)  = log  f (^)  + Y + e^,  j = ;f^l,  +2,  . . .,  +n-l 


2 

with  Ee.  * 0,  Ee.  = ^ , Y = 0.5772...  . Euler's  constant,  and  a reasonably 
J J 6 ’ ' 

symmetric  distribution  of  e^.  about  0 with  constant  variance.  In  their  paper 
(1975b)  Wahba  and  Wold  use  various  results  from  Cogburn  and  Davis  (1974)  en 
route  to  their  objective  which  is  to  show  (in  principle)  that  the  smoothing 
parameter  chosen  by  CVMSE  converges  to  the  parameter  minimizing  the  mean  squared 


I 

/ 


error . 


Wegman  (1977)  records  the  various  advantages  and  disadvantages  of  using 
log  f(w)  rather  than  f(w)  for  the  spline  fitting  model.  Although  the  fit  to 
log  f(w)  is  improved,  the  kernel  interpretation  and  the  attendant  results  on 
consistency  are  lost.  For  multiple  time  series  (with  which  Wegman  was 
concerned)  various  functions  such  as  transfer  function,  multiple  coherence  and 
coherency  fit  the  spline  model  directly  via  log  > log 

each  of  the  above  functions  may  be  estimated  directly  with  a spline,  rather 
than  as  products  and  quotients  of  spline  estimates. 

§13  Isotonic  Splines 

There  are  many  model  fitting  problems  where  we  either  have  some  prior 
knowle4ge  of  the  form  the  solution  must  take,  or  else  have  some  insight  into 
the  laws  governing  the  system.  This  knowledge  may  be  equivalent  to  ti»e 
requirement  that  the  fitted  function  preserve  some  order  on  the  data  points  - 
and  obvious  example  is  a fitted  distribution  function  - this  must  satisfy 
F(»)  > F(y)  whenever  y.  Functions  which  preserve  (in  some  sense)  an  order 
relation  of  their  requirements  are  called  isotone.  Knowledge  of  isotonicity 
may  follow  from  very  elementary  considerations. 

Some  important  isotone  families  of  functions  are  the  monotone  functions, 
the  convex  functions,  the  positive  functions  and  the  unimodal  functions. 

The  main  virtues  of  isotonic  splines  are  that  they  are  locally  very 
flexible  in  one  direction  for  following  the  true  underlying  function  and 
exceedingly  stiff  in  the  other  direction  so  as  to  filter  out  the  noise.  The 
convex  functions  for  example  can  easily  bend  upwards  but  not  down.  Preliminary 
studies  suggest  this  filtering  is  spectacularly  effective  when  the  noise  vari- 
ance is  large.  A general  account  of  isotonic  splines  is  given  in  Wright  (1977) 
and  a more  statistically  oriented  account  in  the  paper  by  Wright  and  Wegman  (1977). 
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